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Processing extensive and diverse data in real-time is a significant challenge 
in the context of smart cities. Timely access to information and efficient 
analytics is essential for smart city services to make data-driven decisions 
and enhance urban living. Scheduling algorithms play a crucial role in 
ensuring the prompt delivery of services and efficient task completion. This 


paper explores various scheduling techniques, including static, dynamic, and 


hybrid schedulers, and compares their objectives and performance. 
Keywords: Additionally, the study examines two prominent data processing 
frameworks, Hadoop and Spark, and compares their capabilities in handling 


Big data big data in smart cities. With its ability to process large amounts of data 
Hadoop quickly and efficiently, Spark has shown superiority over Hadoop in real- 
Scheduling time data processing and performance optimization. The paper concludes by 
Smart city highlighting the strengths and limitations of each framework. It discusses the 
Spark need for further research in optimizing scheduling techniques and exploring 
hybrid artificial intelligence scheduling for Spark. Overall, the findings 
contribute to a better understanding of data processing in real-time and 
provide insights for researchers and practitioners in smart cities. 
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1. INTRODUCTION 

In today’s world, data growth is reaching new highs based on the recent statistics that by the year 
2050, almost 70% of the world population will live in cities. Due to this very reason development of smart 
cities is extremely necessary. It will be possible to provide smart, efficient, and enhanced solutions by 
building these smart cities. This all can be done by a smart structure built up. As the towns are getting 
converted into smart domain form and there is advent in other forms of modern technology, that leads to the 
rise of the smart city (SC) is gaining much attention; it is now being seen as a new paradigm of intelligent 
city development and sustainable socio-economic growth [1], [2]. To enhance the quality of life, smart city 
proposes a novel approach to the design and operation of urban infrastructure, including infrastructure for 
housing, transportation, public services, utilities, health care, and more. 

Smart cities are those in which human capital and information and communication technology 
investments lead to long-term economic development and good quality of life [3]. Cities are therefore 
necessary for tackling significant public and financial challenges, such as low carbon expansion, emission 
reduction, energy efficiency, shared energy resources, economic development, and more [4]. The reason 
behind moving to smart cities is that they can provide services on a citizen-demand basis. In this way, their 
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needs are responded to in a better way by the organizations and businesses. Two basic requirements to attain 
these personalized services include the ability to understand the user’s current needs and to adapt later on 
concerning changes in the user’s behavior. To serve this purpose, data analysis is needed. However, the 
precise timing in the placement and occurrence of this analysis is highly crucial. This smart setup must 
become a reality by appropriately utilizing internet-embedded devices. These devices include sensors and 
electronics capable of communicating with each other via a network. However, these devices generate a 
massive amount of heterogeneous data named big data [5]-[7]. 

The number of devices that are data-producing in smart cities has expanded dramatically throughout 
the globe, and it will not be wrong to mention that smart cities are one of the primary sources [8]. Since then, 
the world's information output has skyrocketed, leading to a new phenomenon known as big data. Big data is 
a term used to describe very massive and complicated datasets that cannot be processed using conventional 
methods [9]. Such an extensive data set is a significant barrier to traditional data processing methods. Google 
launched one of the practical frameworks for processing massive data, MapReduce, in 2004 [10], [11]. It is 
scalable, dependable, and has excellent fault tolerance. In addition, Apache Hadoop is a free, open-source 
software framework. This framework has dominated big data analysis due to its popularity in many areas, 
such as the utilization of all the possible hardware resources available regardless of the computing resource 
from a single server to thousands of serves, a Huge amount of data processed in parallel, fault tolerance, and 
network load balancing. Companies such as Google, Facebook, and Amazon, have a vast amount of data that 
require processing to filter out valuable data. Handling this massive amount of data from smart cities is a 
byword in the current computing area. Since conventional boundaries of the smart city have expanded, 
allowing for predicting emergency events and real-time management using new technology in an innovative 
city system, both of which were previously impossible to achieve. Because competent resources are so 
crucial in the aftermath of an incident, the effectiveness with which they are allocated and scheduled is a 
critical indicator of any response capability [12], [13]. Many researchers are working to find ways and means 
to handle this big data efficiently. 

This paper's significant contribution is to examine scheduling techniques in Hadoop and Spark that 
may be applied in a Smart Cities Environment. This review will fulfill the following objectives: i) provide an 
overview of smart cities, including their significance and benefits; ii) discuss and analyze the challenges of 
processing the massive amounts of data smart cities generate; iii) a detailed comparison of Hadoop and Spark 
scheduling techniques for big data analysis; iv) identify research gaps in current data processing techniques, 
future research directions and open research issues in real-time big data processing scheduling techniques. 
The scientific significance of this review paper is that it will help the researchers understand the need to 
develop algorithms and techniques that can help in the prosperity of smart cities and similar systems, 
eventually leading to the betterment of humanity. 

The division for the rest of the paper is as follows; section 2 explains the smart city and its few 
characteristics. Section 3 presents the processing of real-time data techniques. The Spark and its comparison 
with the Hadoop are discussed in section 4. Finally, the paper is concluded along with future 
recommendations in section 5. 


2. SMART CITY 

A smart city should be able to optimize the utilization of all of its assets, both the material (such as 
transportation systems, energy distribution networks, and natural resources) and immaterial (such as human 
capital, the intellectual capital of companies, and organizational capital in public administration bodies) in 
real-time Flood, fire, earthquake emergency rescue and disaster relief, anti-terrorism, remote control of 
hazardous areas, and so on are some of the many potential uses [14]. In contrast to renewable resources (such 
as solar, wind, and geothermal energy), nonrenewable resources (such as petroleum) will finish over time 
because of the concept of depletion. In recent decades, experts have promoted the ideas of smart energy [15], 
green energy [16], and sustainable energy [17] to raise awareness of challenges and develop the best energy 
usage practices. A smart city has several characteristics, including the transfer of technological, 
infrastructural, and managerial procedures from rural to urban settings. 


2.1. Characteristics of smart cities 

Specific characteristics, keynotes, and organizational frameworks characterize smart cities; the idea 
behind this theme is the foundation of a modern, technologically advanced metropolis. A few of the smart 
city services are given in Figure 1. The figure highlights various features of a smart city, including the 
education system, health system, daily utility management, smart transportation, government sector, and 
public sector. The explanations below elaborate on these features, showcasing how technology and data- 
driven solutions enhance urban life. 
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Figure 1. Smart city services 


2.1.1. Smart energy (sustainable resources) 

The concept of sustainability has maintained its prominence throughout the development of the 
smart city [18]-[20]. Preserving energy and natural resources is critical for a smart city to function 
sustainably [21]-[23]. In the early days of the smart city movement, enhancing residents’ comfort was a 
primary focus. To address these issues, various cities throughout the world ran trials. Intelligent lighting has 
been the attention of specific studies. Citizens may adjust the brightness of the ten thousand sensor-equipped 
streetlights to suit their needs. The goal is to reduce power consumption by approximately 70% [24]. 

Smart energy is appealing more since it promotes an all-encompassing approach to coordinating 
environmentally friendly power, maintainable energy, and a sustainable power source. The goal of eco- 
friendly energy is to use fuel with minimal environmental impact and the least negative natural 
consequences. An alternative energy source that does not deplete the planet's resources over time is the best 
option for meeting the world's energy needs. Increased focus on energy needs has led to a rise in the 
popularity of renewable energy sources. Much research is going on to integrate renewable energy sources 
into intelligent buildings. Smart buildings may use renewable energy, or the existing infrastructure can 
incorporate renewable energy plants. There is a proposal for a microgrid control framework that integrates a 
photovoltaic (PV) power source with a significant energy storage unit [25]. Similarly, Jia et al. [26] propose 
combining solar and wind power to decrease the dependency on critical energy resources. 


2.1.2. Smart transportation 

Accessibility at regional and international levels and the availability of cutting-edge, 
environmentally friendly transportation technologies all fall under the term smart transportation [27], [28]. 
The need for reliable modes of transportation dates back to the dawn of civilization. As technology has 
progressed, all modes of transportation, including land, sea, rail, and air, must follow the same stipulation. 
Neither the world's traditional transportation strategy nor its components were linked or interlinked. A 
cutting-edge linked system has replaced the conventional transportation system due to the concept of 
everyday interfacing devices. Therefore, modern automobiles are part of various communication and route 
frameworks. All the automobiles that participate in a particular transmission are linked together. Several 
standalone transporters are connected to form a global transportation system by increasing the connections 
inside a single transporter. Intelligent transportation systems (ITS) have given much thought to the ad hoc 
vehicle network (VANET) [29]. VANET has widely used vehicle-to-vehicle (VV) and VV-to-infrastructure 
(VI) communication capabilities to manage rural traffic. Using the new transportation framework metrics to 
ensure the metropolitan area's viability comes at the expense of the residents' happiness [30]. 


2.1.3. Smart healthcare 

The present healthcare system is struggling to keep up with the demands of a rapidly expanding 
population. Furthermore, the issue worsens because medical staff numbers have not increased with 
population growth. As a result, the healthcare expectations and the delivery gap widen due to a lack of 
resources and high demand. To meet the need and improve the quality of administration, current innovative 
well-being administrations use sensor organizations, ICT, distributed computing, computer fog, cell phone 
applications, and incredible information handling systems [31]. Integrating electronic clinical records (ECRs) 
further allows for timely decisions with the most up-to-date information [32]. Another method of achieving 
satisfactory portable well-being in metropolitan areas was given by [33]. 
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2.1.4. Waste management 

Rapid urbanization and increased manufacturing have contributed to a rise in waste production. 
Effective waste management is possible via the cooperation of the workforce, municipal authorities, and 
private businesses [34]. There are four main phases of waste management, and they are as follows: waste 
collection, waste removal, waste reuse, and waste recovery. Poor and unmanaged waste management 
generates challenges in human health and the environment [35], making trash management essential for the 
economic development of smart urban areas. 


3. THE PROCESSING OF DATA IN REAL-TIME 

The problem of processing extensive data becomes increasingly difficult as data volume and 
diversity both rise. For efficient analytics, it is necessary to have access to the information within this time 
frame. For instance, real-time data processing is essential in a traffic monitoring system that constantly tracks 
millions of cars. This processing helps in locating alternative routes and calculating arrival times. Timeliness 
is of the utmost significance in this context since a mistake or delay might result in the misrouting of an 
ambulance, putting lives at risk. With more and more people needing access to decision-making tools in real- 
time, timeliness has emerged as a crucial indicator of data quality. Therefore, having enough time to handle 
massive amounts of data in real-time is vital. As a bonus, the timely nature of big data might aid in analyzing 
event streams to enable real-time decision-making. Therefore, the diverse data sets provided by many data 
sources must be integrated into a unified analytical platform to minimize potential delays in real-time 
processing [36]-[39]. The flowchart of data processing in real-time is given in Figure 2. It begins with real- 
time data collection from various sources in the city. A framework is then selected to handle the collected 
data efficiently. Big data analysis is conducted to derive insights and identify patterns. Finally, optimized 
smart city services, such as smart transportation, energy management, waste management, and more, are 
implemented based on the analysis. This systematic approach leverages technology and data to improve 
urban living. 

Real-time data processing is essential for maximizing the effectiveness of smart city services. 
However, effective scheduling becomes crucial to guarantee the prompt delivery of services and the effective 
completion of tasks. Tasks are prioritized and efficient timetables for various processes are created using 
scheduling algorithms and policies. In complex real-time operations, scheduling is especially important for 
ensuring punctuality and meeting requirements. A schedule that meets most requirements for a particular set 
of processes is considered optimal in this context. 
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Figure 2. Real-time data processing flow chart 


3.1. Scheduling 

A scheduler will prioritize the tasks using an algorithm or policy. The job of a scheduler is to create 
a timetable for a group of processes. A process set is realistic if it can timetable itself to meet specific 
requirements. Complex real-time periodic operations often need a guarantee of punctuality. An optimum 
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schedule is a schedule that meets most of the specified requirements for a given set of processes. In most 
cases, a scheduler is optimum if it can schedule every possible collection of operations [40]. Static and 
dynamic [41] are two ways to categorize scheduling algorithms. 


3.1.1. Static scheduler 

Static scheduling, in which a schedule is generated offline. All scheduling decisions, such as when 
to execute each operation or send each message, are contained in the program. During runtime, a simple 
dispatcher distributes jobs based on the schedule. Static scheduling is sometimes known as time-triggered 
scheduling [42]. All scheduling choices are stored in a table for usage at runtime. It is only possible to do this 
with previous information on how the process works. Therefore, this plan can only function if all operations 
are genuinely periodic. Although it demands insight into a process's traits beforehand, the overhead it 
imposes during execution is negligible. Real-time shortest job first (SJF) and rate monotonic (RM) are 
appropriate algorithms for static process scheduling. In both algorithms, priority is allocated depending on 
the deadline and time required to finish the task [43]. 


3.1.2. Dynamic scheduler 

On the other hand, a dynamic approach establishes schedules during execution, providing a more 
adaptable system capable of handling unanticipated occurrences. It is plausible to claim that in safety-critical 
systems, all events should be predictable, and stimulability should be the primary concern before any action; 
this means it needs a scheduling method that is entirely unchanging across time. Online schedulers make 
scheduling choices while the system is actively running. It can be both static and active. These choices are 
grounded in the process context's past and present state—the current systemic condition. The term 
clairvoyant refers to a planner or scheduler. Two commonly used dynamic schedulers in real-time systems 
are the least slack time first (LST) and the earliest deadline first (EDF). In these algorithms, priority is 
decided based on slack time and deadlines of the given processes. These both are considered more suitable 
for soft real-time operating systems [43]. The objectives for the few static and dynamic scheduler algorithms 
are discussed in Table 1. 


Table 1. Static and dynamic scheduler algorithms 


Category Algorithm Objectives achieved 

Static Highest level first with estimated Minimized running time 
Time [44] It simplified the list scheduling algorithm 
Critical path on a processor [44] They limited the cost of computation and time consumed 
Constrained earliest finish time [45] Reduction in implementation time 
Multipriority queueing genetic [46] It decreased the execution time for subtasks 
Parallelism-based earliest finish time [47] It reduced the finish time 

Dynamic Dynamic level scheduling [48] It decreased scheduled time 
Dynamic task scheduling [49] Less complicated, and less time is taken to finish the tasks 
Dynamic load balancing using genetic algorithms Optimized load balancing and processor consumption along 
[50] with high speed 
New response time bounds for fixed priority [51] Better response time 
Load-based schedulability [52 Scheduling based on priority 


3.1.3. Hybrid scheduler 

Schedulers may be either preemptive or non-preemptive. In most cases, pre-emption happens when 
a process with a higher priority becomes executable. As a result of pre-emption, a procedure might go on 
hold without the participant's consent. It is not the practice of non-preemptive schedulers to temporarily 
suspend running tasks; however, it can manage concurrency for processes running inside a resource with 
mutually exclusive access [53]. 

It's also feasible to use a hybrid system. A scheduler can have a pre-emptive design while allowing 
processes to work in less time and then put them on hold; it may define an immutable block of code that 
another method cannot bypass. For instance, the program may poll the system clock for the current time, use 
that to determine how much of a delay is required, and then implement that delay. If the process could pause 
between reading the clock and performing the hold, it would be impossible to write such code. Using caution 
while implementing code that uses delayed pre-emption primitives is essential. The ensuing blocking must be 
limited and minor-often of the same order of magnitude as the overhead of context switching. The computer's 
scheduler uses this strategy to enable a rapid context switch; the switch operates up to 50 processor cycles to 
postpone itself; as a result, the context to be moved is short, and only ten additional cycles can accommodate 
the modified context [54]. 
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3.2. Previous works 

Numerous studies have been conducted on task scheduling, exploring various algorithms and 
models. One notable study by Liu and Layland in 1973 [55] focused on the earliest-deadline-first (EDF) 
scheduling algorithm and fixed priority (FP) scheduling. They investigated these algorithms using the 
ordinary periodic job model, without self-suspensions and demonstrated that EDF is an optimum approach to 
meeting commitments. Additionally, they established the superiority of the rate-monotonic (RM) scheduling 
algorithm among FP techniques. 

Another study in [56], [57] centered around configuring and scheduling emergency resources during 
fire catastrophes. They developed a dynamic model to analyze and address this critical aspect. Similarly, 
constructed emergency resource scheduling models, considering factors such as arbitrary initial time for 
rescue operations and a fixed number of rescuers [57], [58]. In 2010 Sandholm and Lai [59] proposed a 
dynamic proportional share scheduler. This scheduler is an enhancement to Hadoop schedulers that gives the 
volume quality of service (QoS) to diverse users based on priority. This process allows the handler to choose 
tasks and schedule them according to their preference. Change in the allocated resources based on the work 
requirements is also doable. This scheduler becomes fair in case of no users and resource requirements. 

In 2016, Zacheilas and Ķalogeraki [60] introduced a cost-effective scheduling technique. This 
strategy aims to meet financial constraints while also improving task completion time. This method 
implements the Pareto approach. This scheduler aids in decreasing completion time and giving better 
throughput. One aspect that influences a cluster's overall performance is Job response time. This aspect 
inspires Zaharia et al. [61] to suggest a longest approximate time to end (LATE) scheduling algorithm to 
improve response time. This method processes the backup task of a slow task on a separate node. Various 
factors, including increased CPU usage and the sluggishness of background tasks, are the reason behind the 
task's slow progress. 

Locality-aware reduced task scheduling (LARTS) [62]. This algorithm aims to enhance data 
localization, and as a result, there is minimum network traffic. This study also addressed premature shuffle 
concerns. Although early shuffle improves performance and reduces turnaround time, it also burdens the 
network. Therefore, LARTS requested that the shuffle begins once the specific addressing processing is 
done; the sweet spot is the name for the beginning point of the shuffle. In 2012, Guo et al. [63] proposed 
delay scheduling, which addresses the disadvantage of the fair scheduler by attempting to remove the 
difficulties of locating the tasks. When a request for a new task enters delay scheduling, it finds the job that 
meets the equality constraints and does not assign the job if conditions are not fulfilled. 

Table 2 presents a comprehensive comparison of the discussed techniques with other approaches. 
The table provides a detailed evaluation of various factors, such as performance metrics, scalability, resource 
utilization, and adaptability. By comparing the discussed techniques with alternative methods, this analysis 
offers insights into the strengths and limitations of each approach, aiding researchers and practitioners in 
selecting the most suitable scheduling technique for their specific requirements. 


Table 2. Scheduler techniques comparison table 


Response Execution Energy 


Resolved issues Throughput ime time efficient 
Dynamic proportional share scheduler [59] Fairness x xX v N/A 
Longest approximate time to end (LATE) [61] Speculative x J x N/A 
execution 
Delay scheduling [63] Data locality and F, F. x N/A 
fairness 
Cost-effective scheduling technique [60] Data locality v X X N/A 
Locality-aware reduces task scheduling (LARTS) [62] Data locality x v X N/A 
Parental prioritization-based task scheduling algorithm [64] Fairness N/A N/A v X 
Modified particle swarm optimization algorithm [63] N/A N/A N/A v X 
A hybrid of genetic and particle swarm optimization [65] N/A x N/A v v 


3.3. Computable and decidable 

The computational cost and complexity of scheduling for intricate systems are a genuine concern. 
Online scheduling methods should avoid using scheduling algorithms with exponential complexity because 
of their severe influence on the amount of time spent on application software. Furthermore, some scheduling 
considerations are computationally intractable, making them inappropriate for offline scheduling. Therefore, 
computability and decidability must be considered two aspects of computational complexity. The 
computability of a schedule determines if a given schedule is feasible. At the same time, decidability helps to 
assess whether a possible schedule exists [40]. 
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4. DATA PROCESSING FRAMEWORKS 

In big data analytics, efficient data processing frameworks serve as the backbone of handling and 
analyzing vast amounts of information, which includes the data generated by smart cities. These frameworks 
provide the necessary tools and infrastructure to extract valuable insights from diverse data sources, enabling 
cities to make data-driven decisions and optimize urban life. Hadoop and Spark are two prominent data 
processing frameworks that have revolutionized the field and found extensive applications in smart cities. 


4.1. Spark framework 

Spark is an open-source framework for processing large amounts of data quickly and easily. This 
approach debuted in 2009 at Berkeley and was officially adopted by Apache the following year. Iterative 
algorithms in machine learning, interactive data analysis tools, and graph algorithms are all examples of 
recursive systems that benefit from repetition [66]. As a result, the programmers developed the Spark 
framework [67] to accommodate these programs while providing scalability and fault tolerance in the 
MapReduce framework. Parallel operations on these datasets (referring to providing a function to utilize a 
dataset) and resilient distributed datasets [68] are Spark's two primary abstractions for parallel scheduling. 
Resilient distributed datasets were first made possible by Spark (RDDs). Distributed read-only datasets 
(RDDs) are groups of read-only items kept on many computers but can quickly reassemble in case of 
partition removal. It allows the user to store the RDD in the machines' memory and run the parallel process, 
such as MapReduce, many times. As a result, Spark excels in processing recursive algorithms on 
datasets [69], [70]. 


4.2. Hadoop vs Spark 

Aziz et al. [71] analyzed Twitter data using the Spark platform in 2018. It took one second to 
explore all the tweets on Spark. This research has centered on the author's examination of the actual 
execution and completion of the standard Hadoop MapReduce framework, as well as the implementation of 
the Apache Spark framework. Experiment simulations are also run to determine actual-time data utilizing 
Spark and Hadoop. In addition, there is a discussion of Hadoop's constraints and benefits when it comes to its 
implementation in the real-time process. Finally, there is a simulation comparison regarding speed for both 
frameworks. All that shows that Spark is a powerful tool for processing real-time data streams. 

In 2017 Hazarika et al. [72] evaluated the theoretical and practical differences between the Spark 
and Hadoop systems. From what they've seen in their studies, Spark's cache benefits from repeated queries 
like logistic regression and makes them significantly quicker. On the other hand, Spark's performance is 
weak for nonrepetitive queries because of the small cache size. Small iterations, however, benefit 
considerably from Hadoop's speed. 

In 2015, Gopalani and Arora [73] examined two large data processing frameworks, Hadoop and 
Spark. To put it another way, they used Hadoop and Spark to apply the K-means algorithm, a fundamental 
machine learning technique, using a dataset comprised of sensor data and then comparing the two platforms' 
respective execution times. Data showed that Spark performed better than Hadoop in real-world scenarios. 
Furthermore, Gu and Li [74] conducted another comparison of memory needs and processing times for the 
Hadoop and Spark systems. The PageRank algorithm was implemented in several network datasets in the 
same study. According to the findings, Spark used more memory while simultaneously taking less time to 
execute, as impressive is the fact that Spark is 73% faster than Hadoop when dealing with massive datasets. 

In 2013 Zaharia et al. [75] used logistic regression to examine the Hadoop and Spark frameworks. 
The author of this study focused on a subset of software programs that recycle data from an active, dynamical 
database using a multi-threaded, parallel architecture. These include many iterative machine-learning 
algorithms and interactive data analysis tools. Spark introduces an abstraction known as resilient distributed 
datasets (RDDs) to help achieve these objectives. Spark can beat Hadoop by a factor of ten in repeated 
machine learning tasks, and it can be used interactively on a 39 GB query dataset with a response time of less 
than one second. According to the findings of this article, Spark is the preferable option. 

Liang et al. [76] compared Hadoop, Spark, and big dataMPI in terms of execution speed, memory 
footprint, and central processing unit consumption in 2014. The author uses Big Data Bench, a benchmark 
suite for large data sets, to conduct in-depth analyses of Spark, DataMPI, and Hadoop's resource use 
characterizations and performance. In these investigations, DataMPI delivered a 57% improvement over 
Spark. Furthermore, it has improved Hadoop by 50% regarding job implementation time. DataMPI's main 
advantages were its efficient communication mechanisms and its high throughput. In addition, DataMPI 
makes better use of its resources (disc, CPU, network I/O, and memory) than the other two structures and 
frameworks. As a result, the MPI platform outperformed both Spark and Hadoop, and Spark even surpassed 
Hadoop. 

Mavridis and Karatza [77] assessed the performance of log file analysis using both Hadoop and 
Spark. They have looked at log file analysis using the cloud computing frameworks Apache Hadoop® and 
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Apache Sparks. The authors have enhanced the log file analysis in both frameworks so that they can handle 
real-world data from the Apache Web Server. They have also conducted other tests with varied parameters to 
evaluate and contrast the two frameworks and structures. Im and Moseley [78] used MapReduce to examine 
conditional lower bounds on graph connectedness. This research discovered the possible problems that don't 
allow efficient external algorithms to integrate into MapReduce. This study also answers a fundamental 
research question: how to tell whether a graph has a closed cycle. In particular, they examine the issue of 
designing an algorithm to determine whether or not two unconnected processes exist in a given network. This 
challenge aims to verify the graph's global structure so that all local graph parts are equivalent. They identify 
the natural class of algorithms that can only transfer/store/process data and information in paths, proving that 
no random algorithm can answer the question in a sublogarithmic number of rounds. Kodali et al. [79] work 
on a k-NN-based method using MapReduce for meta-path categorization in heterogeneous information 
networks. The authors of this study used the Passim similarity measure in a Heterogeneous Information 
Network to classify the meta-paths uncovered by applying the well-known MapReduce paradigm to the 
problem of locating k-nearest neighbors. Moreover, they figured out the classification technique to deal with 
the massive data found in HINs using MapReduce. 

Wang et al. [80] conducted a study on MapReduce task programming with excessive energy 
consumption in heterogeneous clusters; as a result, there was a task programming framework for 
heterogeneous groups that considered resource utilization, deadlines, and data locality to keep energy costs to 
a minimum. The framework includes updates to the slot list, new task lists, and scheduling. In addition, a 
proposal for a novel job sequence to create a rational list of jobs and tasks based on factors like expected 
work processing times, available job slots, and due dates. Wei et al. [81] introduced a MapReduce-centric 
clustering method for handling large datasets. Their study compared and contrasted the MapReduce 
implementation of the Canopy method with the widely used K-means algorithm. By evaluating their 
performance and effectiveness in clustering large datasets, Wei et al. [81] shed light on the advantages and 
limitations of these approaches. 

In a related study, Roger et al. [82] proposed a preemptive fair scheduler strategy for the disco 
MapReduce architecture. They explored how the Preemptive Fair Scheduler Policy impacted job execution 
times in both experimental production and research settings. While the strategy proved beneficial in reducing 
execution times for production jobs, it had a negative impact on research jobs. The author provided insights 
into the trade-offs and considerations of implementing the Preemptive Fair Scheduler Policy. 

Jang et al. [83] proposed investigating k-nearest neighbor input initialization for neural network 
inversion. This study reveals a fresh way of initializing the input variables of neural networks, centered on 
the k-nearest neighbor technique (k-NN). The proposed method finds inputs that generate an outcome near a 
target output within a training dataset and combines them to form the starting input variables. Chen et al. [84] 
performed quick peak density clustering for large-scale data emphasizing KNN. The proposed methodology, 
computed using a fast KNN algorithm like a cover tree, significantly improves over the previous method of 
computing density using kNN-density. It uses kNN-density and a quick form to differentiate between local 
and nonlocal density peaks. 

Janardhan and Samuel [85] investigated the optimal parallelism in the Spark architecture on Hadoop 
yet another resource negotiator (YARN) to get the most out of the cluster's resources. This research suggests 
the best parallelism conformation and configuration for an Apache Spark architecture deployed on a Hadoop 
YARN cluster. However, the concepts depend on the studies’ findings that examine the reliance on 
parallelism at each level of Spark application performance. A zone-based resource allocation technique called 
Zebras enhances Spark's efficiency in a heterogeneous cluster and has also been proposed; by proposing and 
implementing this technique, optimizing resource utilization and allocation within the Spark cluster 
ultimately improves its overall performance. 

According to Hussain and Surendran [86], efficient content-based fast-response picture retrieval is 
explored using the MapReduce and Spark model framework. The authors leverage the MapReduce model 
structure to sign efficiently and index massive volumes of photos, enabling fast retrieval based on content. 
Furthermore, in 2021, Mostafaeipour et al. [87] adopted Spark as a proportional method for recovering the 
index, operating on the upper layer of the MapReduce framework and utilizing the Hadoop distributed file 
system (HDFS). Their work focuses on efficient index recovery using Spark's capabilities within the 
MapReduce ecosystem. 

In addition to the insights provided, Table 3 further reinforces the key differences between Spark 
and Hadoop. The table highlights specific research gaps and indicates whether each gap is present in Spark or 
Hadoop. This comprehensive comparison aids in understanding the unique strengths and limitations of each 
framework, enabling researchers and practitioners to make informed decisions regarding their data 
processing needs in the context of smart cities. 
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Table 3. Data processing aspects comparison table 
Aspect Spark Hadoop 

Performance optimization 

Cache and query optimization 


Memory usage and processing time 
Machine learning performance 
Resource utilization 

Log file analysis performance 
Mapreduce efficiency 

Energy efficiency 
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5. CONCLUSION 

As the world merges toward the era of smart cities highly dependent on IoT and Web Apps. Smart 
cities are gaining popularity as they positively impact a country’s economy. Intelligent and rapid decision- 
making are critical requisites of a sophisticated smart city system. At the same time, this system generates 
multiple files known as big data that revolve around the characteristics of the 3 V’s, which has led to the 
recognition of a great problem. New ideologies, strategies, and frameworks must be introduced to 
constructively overcome the issue of handling and scheduling big data. This article provides an overview of a 
thorough study of work done for scheduling techniques in the Hadoop and Spark environments. Dynamic 
Scheduling is crucial to achieving high performance in extensive data processing. Data volume, diversity, 
data velocity, security and privacy, cost, connectivity, and data sharing are just a few of the difficulties with 
big data. From the conducted review, it can be easily said that the baseline is adequate for processing if the 
data is static, and it is possible to wait until batch processing is finished. However, Spark has had an 
advantage regarding real-time data processing in parallelism. It still needs extensive research to conclude that 
Spark is the only solution for analyzing real-time streaming data. 

Additionally, as demonstrated in the study, Spark could evaluate data quickly. Spark is a top-notch 
memory processing technology that enables real-time streaming data processing on massive amounts of data. 
Compared to Hadoop, Apache Spark is far more sophisticated. It supports several needs, including batch, 
streaming, and real-time processing. In the future, schedule optimization can be done for Hadoop. For Spark, 
it can be done by modifying various default parameter configuration settings, introducing new scheduling 
techniques, and hybrid artificial intelligence scheduling. 
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